In [ ]:
(DONE) your abstract is assuming way too much knowledge and shouldn't be quoting someone else at length. Change it to be a description of the existing model, and then a description of the what you did here
(DONE) 1.1: don't say 'the original paper'. It's Schultheis, Glasmeier & Nadeu [2014]
(DONE) 3.4 'linearly impute' should be 'linearly imputing'
(DONE) 4.1 typo in Mercer County (also, County should be capitalized or plural)
(Ask For Feedback) 4.2 it still has a TODO
(DONE) 4.3 the choropleth scale consistent for all three charts? if so, that's fantastic. (yes, I should copy the legend over but its defintely the same scale)
(DONE) 4.5 the scale for these charts is kinda hilarious. I think you should get rid of it and just depend on the title. I guess it's your choice. (same applies for figure 17)
(DONE) 4.6 you have a ref to 'table ??'.
(DONE) 4.6 Also, I have no idea what you're trying to say with the most1, most2 groups. can you improve that explanation?
(DONE) 4.6 table 1 - in the Variable names, 2nd column, you have 'most' where I think you want 'most1'
(DONE) I don't understand Figure 20: what are the two distributions?
(DONE) 6.1 - you should professionalize the language here: 'not sure why' is not good enough.
I think this needs a little bit more polishing/reviewing, but it's in a pretty good place.
Finished loading in minimum and median wage data. Will look if there is a breakdown of wage by race by county / state. Finished creating custom violin plot that takes into account weight, and pushed violin plots of living wage breakdown by race. Need to fix it such that all aces show, there is a bug in it now.
Added first map plot to show county living wages for 2014. Noticed that a portion of the map is not there, which confirms that there is something wrong with "East Coast" states (I noticed this earlier in the "regional" section). I have to go back over my previous steps and debugging.
Final Output TODO:
Overall / Project TODO:
Loaded minimum wage data and median wage data. For each, I need to do some data validation and confirm:
Finished up some more of the race break down graphs. I want to do a boxplot or violin plot of the county breakdown, but matplotlib does not do a weighted version. Taking the data and 'expanding' it basd on the integer weights would work in theory but the list ends up too large. Both type of plots can take in a custom function to return custom values for where the box is placed, etc. Might open source solution, but need to move forward on looking at the median and minimum wage comparisons
Over the past week I have done quite a lot more analysis of the living wage. I am now trying to move on the living wage gap, which will have a similar analysis. Starting to download wage data now.
Overall / Project TODO:
Immediate TODO:
Added proper population weigthed averages to 2 out of 4 sections; will finish that up tomorrow. Using population data from housing data. Cleaned up quite a few of the plots, using better colors and dashed lines as to not insinuate that I am interpolating data between years.
Added a natinal average and broke it down by model variables to see which of them increased the most from 2004 - 2014. Not surprising that its rent. Interesting that other_costs have gone down (is this people cutting back on non-needed items to pay for increasing housing costs?
Holiday weekend is over so back to work. Expanded on the analysis between population subgroups of counties, and added some visualizations. WIll expand on the discussion, including limitations of analysis due to a lack of regional weighting for the CEX variables. Tomorrow I will move on to downloading wage data (minimum, median) and calculate the gap and do a similar analysis.
Think about population weighted averages.
Show error bars in plots?
Completed IS622 project, so now I have only this left.
Plotted some counties by hand Working on deriving state aervages, and plotting regional averages as well. Importing list of most populous counties as well to look at differences between counties that are populous versus non
Issue with 2004 - 2005 - 2006 in FMR data really effects mostly eastern states, so I need to go back and look at the what is wrong. I think county breakdowns for these states differ, and deriving FIPS codes seems to be problematic
Spent Friday and aturday on IS622 project. Almost done with it, will work on it at end of this week. Need to step up work on thesis now. Current plan is to:
Load in CEX data and forget regional weights for now (DONE)
Produce visualization of all model variables in appendix (DONE)
Redo 2002 fips matching
Find counties that appear in all years for housing data (DONE)
Create two final dataframes with counties as rows: (Started ...)
Produce visualization of living wage for one county (DONE)
Created some better formatting for the document. Got the latest inflation numbers from BLS calculator.
Adjusted for inflation on food data, which is now in data frame form. Also adjusted inflation for housing data, and created the multi-level dataframe (and created more consistent column names).
Going through all variables to confirm that the methodology works, is adjusted for inflation and the final data is in data frame format.
TODO:
Finished loading data from housing files after exporting them all to CSV. The data for 2002 is not loaded since it does not include a FIPS column. I will check to see if an alternate download exists, but may have to match based on county name.
Need to clean this data up and use multi=level index. Created state to code mapping, as well as a state to region mapping as well (to do regional weighting of model variables that need it). Mapped each county to a region via adding a new column.
Tomorrow:
I did some more work for IS622 to get some more work off my plate. Most of the hw for rest of semester is done, just need to work on project. This helps free more time for continuous work going forward. I have been feeling ill past two - three days so I didn't get much work done as I would like. This holiday weekend I need / will make significant progress.
Got off track again, but going forward I will have another night (Friday) to do work on this, as I dropped some extracirricular activities to make more time for this
Updated schedule:
Started loading in housing data from FMR data. Converted XLS files to CSV and imported into pandas and filtered out columns we do not need. Going to figure out multi-level indexes to start storing this in a more convienent fashion (as well as other forms of data). Confirmed data matches listed data for one county, but need to confirm methodology for HUD areas (like NYC, which uses a population weighted average).
Some success: went over the food data and now getting exact values. There is a note in the USDA PDFs about adding 20% to the values when looking at individuals, since their calculations are for individuals in families. This corrects the discrepency and I am confident this portion is now officially done. Also this confirms that my model should be able to get the same exact output from the posted model data.
Insurance data has been downloaded and parsed into a data frame. Main problem here is data only goes back to 2006. Either need to find another data source to go back to 2001, or limit the model to 8 years (2006 - 2014).
Added some more to the outline of general steps I need to take at end of other notebook
Updated Schedule this week:
Looked at the CEX data again, but I do not see how to get the numbers to line up exactly with the model numbers I see on the county websites (i.e. http://livingwage.mit.edu/counties/36047). They are not far off, but the south and midwest seem to not be regionally weighed correctly.
Since the 'other' variable comes from the CEX, I took a look at it as well but same thing as transportation. Similar numbers but something seems off about the south and midwest.
Schedule this week:
tomorrow
wednesday / thur
thursday / friday
TODO - compare food and cex regional definitions (http://www.bls.gov/cex/csxgloss.htm)
TODO - the flation numbers used are old; use their numbers to confirm model accuracy but then will have to use new numbers to scale to 2015 dollars
I got a bit delayed this week, but my plan is to finish up Week13 and Week14 homeworks by monday nightm which will free me up for two weeks with no homework to worry about. The plan then is to spend continuguous chunks of time on the thesis and make significant progress by the end of this week.
Currently, I am in an email conversation with one of the model authors about regional weights. She has described the methodology, but need to confirm that this works with the data (as I thought I tried what she suggested).
Downloading other data sets as I think about how to use the Consumer Expenditure Survey correctly (with respect to regional differences). Started with child care and had to manually download PDFs from ChildCareAware.org. Sadly, they only go back to 2010. I can now either:
Currently I am only focusing on modeling costs for a single adult (an assumption I made early on) since I am interested in trends, and the other 'family configurations' are just linear combinations of the costs for one adult and for one child. However if I wanted to extend the numbers for 1 adult + 1 child, I would have to look into this further. For now I'll move on.
Downloaded all the housing data, will determine what we need to extract.
Will work on the insurance component tomorrow
Tried loading in the transportation cost data from the customer expenditure survey for 2014. The data is in excel files, which makes it OK to pull data out, but I am still confused as to how to figure out how the original model deals with regional differences. I emailed the model author, hoping for a response soon. The numbers I get do not line up well for all regions.
Also need to go over what year the model is for. Thought the model is reporting 2014 estimates; and in the case of data that is only available in 2013, we adjust for inflation to get an estimate for 2014. Need to go over the details carefully.
For my current theory of how to do regional differences, I figured out how to get the values I think I need from the aggregate files since those are the only files going back to 2001.
TODO
No progress to speak of. I got an email back from the library, with som hints to find more journals to look into. Will look into this sometime this weekend. I need to start placing deadlines for the data to be injested, or I will never get this done. Will come up with schedule for this weeknd tomorrow night.
Sadly, due to family events and issues with my stomach, I have not been able to do much work. Starting work tonight to finish up loading of food data into dataframe. Eventually got it done and food data is loaded, just needs to be adjusted for inflation
Question: for inflation, everything in original model is in 2014 dollars sinc ethey did not do this for 2015. To test if I am getting same values as the original model, I should inflate all past values to 2014. But what about 2015? Exclude for now? Deflate back to 2014 dollars?
Journals: Will look into some journals as well. Emailed the librarians at the Newman Library to see if they can help pin point journals. The Economic Policy Institute looks interesting, though clearly a bit partisan.
Using the Elsevier Journal Finder, I found the following journals that might be useful:
Some I think might be related but I do not think this paper would meet requirements or would only be tangentially related:
Most costs acorss counties in NY state seem to be the same, with the biggest county level changes coming from housing. As a matter of fact, looks like only housing changes. This means the living wage model does come up with a county-level approximation, but most of the variables are state or national level averages with housing being the only true county-data. In some instances, regional differences are accounted for, like food.
Health definition: "The health component of the basic needs budget includes: (1) health insurance costs for employer sponsored plans, (2) medical services, (3) drugs, (4) medical supplies."
Costs for (2) medical services , (3) drugs and (4) medical supplies were derived from 2013 national expenditure estimates by household size provided in the 2014 Bureau of Labor Statistics Consumer Expenditure Survey. These estimates were further adjusted for regional differences using annual income expenditure shares reported by region. Values were inflated to 2014 dollars using the Consumer Price Index inflation multiplier from the Bureau of Labor Statistics
Costs for (1) health insurance calculated using the Health Insurance Component Analytical Tool (MEPSnet/IC) provided online by the Agency for Healthcare Research and Quality
Working on the thesis a litle, but didn't have much time. Did some reserch on methodology and realized I can make a simplifying assumption. The model calculates living wages for 12 different sets of families, based on the number of adults and children. Since each of the 12 combinations is a linear combination of adult and child costs from the model, and I am looking for trends and correlations, I will only calculate numbers based on a single adult. This has the side benefit of removing a variable from the model, child care costs.
I also looked over the methodology for the food cost variable, and the logic is simple: food costs are taken from the second chepatest meal in the USDA outline, and are taken to represent a national average. Each county takes this national average and is weighed by a regional factor.
In [ ]: